Benchmarking Python Clustering Algorithms on 2D Data

Other notebooks perform a more genenral analysis of clustering algorithms; this notebook is looking at the special case of two dimensional data. Why two dimensional? The algorithms used in DBSCAN and HDBSCAN (in particular algorithms using kd-trees and ball trees) have significantly better performance and scaling in lower dimensions. I wanted to look at how scaling compares in what is effectively the best case scenario for such algorithms. As in the other notebooks we'll be looking at a number of algorithms and implementations.

The implementations being test are:

Sklearn (which implements several algorithms):
- K-Means clustering
- DBSCAN clustering
- Agglomerative clustering
- Spectral clustering
- Affinity Propagation
Scipy (which provides basic algorithms):
- K-Means clustering
- Agglomerative clustering
Fastcluster (which provides very fast agglomerative clustering in C++)
DeBaCl (Density Based Clustering; similar to a mix of DBSCAN and Agglomerative)
HDBSCAN (A robust hierarchical version of DBSCAN)



In [1]:

    
import hdbscan
import debacl
import fastcluster
import sklearn.cluster
import scipy.cluster
import sklearn.datasets
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_palette('Paired', 10)
sns.set_color_codes()

We can borrow the benchmarking code from other notebooks to benchmark multiple different sizes of dataset. Because some clustering algorithms have performance that can vary quite a lot depending on the exact nature of the dataset we'll also need to run several times on randomly generated datasets of each size so as to get a better idea of the average case performance.

We also need to generalise over algorithms which don't necessarily all have the same API. We can resolve that by taking a clustering function, argument tuple and keywords dictionary to let us do semi-arbitrary calls (fortunately all the algorithms do at least take the dataset to cluster as the first parameter).

Finally some algorithms scale poorly, and I don't want to spend forever doing clustering of random datasets so we'll cap the maximum time an algorithm can use; once it has taken longer than max time we'll just abort there and leave the remaining entries in our datasize by samples matrix unfilled.

In the end this all amounts to a fairly straightforward set of nested loops (over datasizes and number of samples) with calls to sklearn to generate mock data and the clustering function inside a timer. Add in some early abort and we're done.

This time around we'll also set the default dataset dimension to be two.



In [2]:

    
def benchmark_algorithm(dataset_sizes, cluster_function, function_args, function_kwds,
                        dataset_dimension=2, max_time=45, sample_size=2):
    
    # Initialize the result with NaNs so that any unfilled entries 
    # will be considered NULL when we convert to a pandas dataframe at the end
    result = np.nan * np.ones((len(dataset_sizes), sample_size))
    for index, size in enumerate(dataset_sizes):
        dataset_n_clusters = int(np.log2(size))
        for s in range(sample_size):
            # Use sklearns make_blobs to generate a random dataset with specified size
            # dimension and number of clusters
            data, _ = sklearn.datasets.make_blobs(n_samples=size, 
                                                       n_features=dataset_dimension, 
                                                       centers=dataset_n_clusters)
            
            # Start the clustering with a timer
            start_time = time.time()
            cluster_function(data, *function_args, **function_kwds)
            time_taken = time.time() - start_time
            
            # If we are taking more than max_time then abort -- we don't
            # want to spend excessive time on slow algorithms
            if time_taken > max_time:
                result[index, s] = time_taken
                return pd.DataFrame(np.vstack([dataset_sizes.repeat(sample_size), 
                                               result.flatten()]).T, columns=['x','y'])
            else:
                result[index, s] = time_taken
        
    # Return the result as a dataframe for easier handling with seaborn afterwards
    return pd.DataFrame(np.vstack([dataset_sizes.repeat(sample_size), 
                                   result.flatten()]).T, columns=['x','y'])

Comparison of all ten implementations

Now we need a range of dataset sizes to test out our algorithm. Since the scaling performance is wildly different over the ten implementations we're going to look at it will be beneficial to have a number of very small dataset sizes, and increasing spacing as we get larger, spanning out to 32000 datapoints to cluster (to begin with). Numpy provides convenient ways to get this done via arange and vector multiplication. We'll start with step sizes of 500, then shift to steps of 1000 past 3000 datapoints, and finally steps of 2000 past 6000 datapoints.



In [3]:

    
dataset_sizes = np.hstack([np.arange(1, 6) * 500, np.arange(3,7) * 1000, np.arange(4,17) * 2000])

Now it is just a matter of running all the clustering algorithms via our benchmark function to collect up all the requsite data. This could be prettier, rolled up into functions appropriately, but sometimes brute force is good enough. More importantly (for me) since this can take a significant amount of compute time, I wanted to be able to comment out algorithms that were slow or I was uninterested in easily. Which brings me to a warning for you the reader and potential user of the notebook: this next step is very expensive. We are running ten different clustering algorithms multiple times each on twenty two different dataset sizes -- and some of the clustering algorithms are slow (we are capping out at forty five seconds per run). That means that the next cell can take an hour or more to run. That doesn't mean "Don't try this at home" (I actually encourage you to try this out yourself and play with dataset parameters and clustering parameters) but it does mean you should be patient if you're going to!



In [4]:

    
k_means = sklearn.cluster.KMeans(10)
k_means_data = benchmark_algorithm(dataset_sizes, k_means.fit, (), {})

dbscan = sklearn.cluster.DBSCAN(eps=1.25)
dbscan_data = benchmark_algorithm(dataset_sizes, dbscan.fit, (), {})

scipy_k_means_data = benchmark_algorithm(dataset_sizes, scipy.cluster.vq.kmeans, (10,), {})

scipy_single_data = benchmark_algorithm(dataset_sizes, scipy.cluster.hierarchy.single, (), {})

fastclust_data = benchmark_algorithm(dataset_sizes, fastcluster.single, (), {})

hdbscan_ = hdbscan.HDBSCAN()
hdbscan_data = benchmark_algorithm(dataset_sizes, hdbscan_.fit, (), {})

debacl_data = benchmark_algorithm(dataset_sizes, debacl.geom_tree.geomTree, (5, 5), {'verbose':False})

agglomerative = sklearn.cluster.AgglomerativeClustering(10)
agg_data = benchmark_algorithm(dataset_sizes, agglomerative.fit, (), {}, sample_size=4)

spectral = sklearn.cluster.SpectralClustering(10)
spectral_data = benchmark_algorithm(dataset_sizes, spectral.fit, (), {}, sample_size=6)

affinity_prop = sklearn.cluster.AffinityPropagation()
ap_data = benchmark_algorithm(dataset_sizes, affinity_prop.fit, (), {}, sample_size=3)

Now we need to plot the results so we can see what is going on. The catch is that we have several datapoints for each dataset size and ultimately we would like to try and fit a curve through all of it to get the general scaling trend. Fortunately seaborn comes to the rescue here by providing regplot which plots a regression through a dataset, supports higher order regression (we should probably use order two as most algorithms are effectively quadratic) and handles multiple datapoints for each x-value cleanly (using the x_estimator keyword to put a point at the mean and draw an error bar to cover the range of data).



In [5]:

    
sns.regplot(x='x', y='y', data=k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=hdbscan_data, order=2, label='HDBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=fastclust_data, order=2, label='Fastcluster Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=scipy_single_data, order=2, label='Scipy Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=debacl_data, order=2, label='DeBaCl Geom Tree', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=spectral_data, order=2, label='Sklearn Spectral', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=agg_data, order=2, label='Sklearn Agglomerative', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=ap_data, order=2, label='Sklearn Affinity Propagation', x_estimator=np.mean)
plt.gca().axis([0, 34000, 0, 120])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of Clustering Implementations')
plt.legend()









    Out[5]:





<matplotlib.legend.Legend at 0x10ce76850>






    



/Users/leland/.conda/envs/hdbscan_dev/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

A few features stand out. First of all there appear to be essentially two classes of implementation. The fast implementations tend to be implementations of single linkage agglomerative clustering, K-means, and DBSCAN. The slow cases are largely from sklearn and include agglomerative clustering (in this case using Ward instead of single linkage).

We really wanted to see what happens in the two dimensional case for the better scaling algorithms, so let's drop out those slow algorithms so we can scale out a little further and get a closer look at the various algorithms that managed 32000 points in under twenty seconds. There is almost undoubtedly more to learn as we get ever larger dataset sizes.

Comparison of fast implementations

Let's compare the six fastest implementations now. We can scale out a little further as well; based on the curves above it looks like we should be able to comfortably get to 60000 data points without taking much more than a minute per run. We can also note that most of these implementations weren't that noisy so we can get away with a single run per dataset size.



In [6]:

    
large_dataset_sizes = np.arange(1,16) * 4000

hdbscan_prims = hdbscan.HDBSCAN(algorithm='prims_kdtree')
large_hdbscan_prims_data = benchmark_algorithm(large_dataset_sizes, 
                                         hdbscan_prims.fit, (), {}, max_time=90, sample_size=1)

hdbscan_boruvka = hdbscan.HDBSCAN(algorithm='boruvka_kdtree')
large_hdbscan_boruvka_data = benchmark_algorithm(large_dataset_sizes, 
                                         hdbscan_boruvka.fit, (), {}, max_time=90, sample_size=1)

k_means = sklearn.cluster.KMeans(10)
large_k_means_data = benchmark_algorithm(large_dataset_sizes, 
                                         k_means.fit, (), {}, max_time=90, sample_size=1)

dbscan = sklearn.cluster.DBSCAN(eps=1.25, min_samples=5)
large_dbscan_data = benchmark_algorithm(large_dataset_sizes, 
                                        dbscan.fit, (), {}, max_time=90, sample_size=1)

large_fastclust_data = benchmark_algorithm(large_dataset_sizes, 
                                           fastcluster.single, (), {}, max_time=90, sample_size=1)

large_scipy_k_means_data = benchmark_algorithm(large_dataset_sizes, 
                                               scipy.cluster.vq.kmeans, (10,), {}, max_time=90, sample_size=1)

large_scipy_single_data = benchmark_algorithm(large_dataset_sizes, 
                                              scipy.cluster.hierarchy.single, (), {}, max_time=90, sample_size=1)

Again we can use seaborn to do curve fitting and plotting, exactly as before.



In [7]:

    
sns.regplot(x='x', y='y', data=large_k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_hdbscan_boruvka_data, order=2, label='HDBSCAN Boruvka', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_fastclust_data, order=2, label='Fastcluster Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_single_data, order=2, label='Scipy Single Linkage', x_estimator=np.mean)
#sns.regplot(x='x', y='y', data=large_hdbscan_prims_data, order=2, label='HDBSCAN Prims', x_estimator=np.mean)

plt.gca().axis([0, 64000, 0, 150])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of Fastest Clustering Implementations')
plt.legend()









    Out[7]:





<matplotlib.legend.Legend at 0x108721210>

Clearly something has gone woefully wrong with the curve fitting for the two single linkage algorithms, but what exactly? If we look at the raw data we can see.



In [8]:

    
large_fastclust_data.tail(10)



In [9]:

    
large_scipy_single_data.tail(10)

It seems that at around 44000 points we hit a wall and the runtimes spiked. A hint is that I'm running this on a laptop with 8GB of RAM. Both single linkage algorithms use scipy.spatial.pdist to compute pairwise distances between points, which returns an array of shape (n(n-1)/2, 1) of doubles. A quick computation shows that that array of distances is quite large once we nave 44000 points:



In [10]:

    
size_of_array = 44000 * (44000 - 1) / 2         # from pdist documentation
bytes_in_array = size_of_array * 8              # Since doubles use 8 bytes
gigabytes_used = bytes_in_array / (1024.0 ** 3)  # divide out to get the number of GB
gigabytes_used









    Out[10]:





7.211998105049133

If we assume that my laptop is keeping much other than that distance array in RAM then clearly we are going to spend time paging out the distance array to disk and back and hence we will see the runtimes increase dramatically as we become disk IO bound.

For fastcluster there is an alternative linkage_vector function that can be used if you are doing single linkage and have the original data (as opposed to a matrix of dissimilarities). We'll rerun fastcluster with this instead.

For scipy single linkahe if we just leave off the last element we can get a better idea of the curves, but keep in mind that this single linkage implemtation does not scale past a limit set by your available RAM.



In [11]:

    
large_fastclust_data = benchmark_algorithm(large_dataset_sizes, 
                                           fastcluster.linkage_vector, (), {}, max_time=90, sample_size=1)

Now we just need to replot with the new fastcluster data, and the truncated scipy single linkage data.



In [12]:

    
sns.regplot(x='x', y='y', data=large_k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_hdbscan_boruvka_data, order=2, label='HDBSCAN Boruvka', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_fastclust_data, order=2, label='Fastcluster Single Linkage', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=large_scipy_single_data[:8], order=2, label='Scipy Single Linkage', x_estimator=np.mean)
#sns.regplot(x='x', y='y', data=large_hdbscan_prims_data, order=2, label='HDBSCAN Prims', x_estimator=np.mean)

plt.gca().axis([0, 64000, 0, 150])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of Fastest Clustering Implementations')
plt.legend()









    Out[12]:





<matplotlib.legend.Legend at 0x104401d50>

This rules out scipy single linkage as beign in the same class as the others even igf it could scale. Even fastcluster is starting to show the effects of having an $O(n^2)$ algorithm, despite having worked very hard to improve constants.

In practice this is going to mean that for larger datasets you are going to be very constrained in what algorithms you can apply: if you get enough datapoints only K-Means, DBSCAN and HDBSCAN will be left. Let's get a closer look at the performance of these.

Comparison of K-Means and DBSCAN and HDBSCAN implementations

At this point we can scale out to 300000 datapoints easily enough: These algorithms can use various data structures to avoid having to compute the full pairwise distance matrix and thus potentially have much more favourable asymptotic complexity.



In [13]:

    
huge_dataset_sizes = np.arange(1,16) * 20000

k_means = sklearn.cluster.KMeans(10)
huge_k_means_data = benchmark_algorithm(huge_dataset_sizes, 
                                        k_means.fit, (), {}, max_time=120, sample_size=2)

dbscan = sklearn.cluster.DBSCAN()
huge_dbscan_data = benchmark_algorithm(huge_dataset_sizes, 
                                       dbscan.fit, (), {}, max_time=240, sample_size=2)

huge_scipy_k_means_data = benchmark_algorithm(huge_dataset_sizes, 
                                              scipy.cluster.vq.kmeans, (10,), {}, max_time=240, sample_size=2)

hdbscan_boruvka = hdbscan.HDBSCAN(algorithm='boruvka_kdtree')
huge_hdbscan_data = benchmark_algorithm(huge_dataset_sizes, 
                                       hdbscan_boruvka.fit, (), {}, max_time=240, sample_size=4)

huge_fastcluster_data = benchmark_algorithm(huge_dataset_sizes, 
                                            fastcluster.linkage_vector, (), {}, max_time=240, sample_size=2)



In [14]:

    
sns.regplot(x='x', y='y', data=huge_k_means_data, order=2, label='Sklearn K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=huge_dbscan_data, order=2, label='Sklearn DBSCAN', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=huge_scipy_k_means_data, order=2, label='Scipy K-Means', x_estimator=np.mean)
sns.regplot(x='x', y='y', data=huge_hdbscan_data, order=2, label='HDBSCAN', x_estimator=np.mean)
#sns.regplot(x='x', y='y', data=huge_fastcluster_data, order=2, label='Fastcluster', x_estimator=np.mean)


plt.gca().axis([0, 310000, 0, 60])
plt.gca().set_xlabel('Number of data points')
plt.gca().set_ylabel('Time taken to cluster (s)')
plt.title('Performance Comparison of K-Means and DBSCAN')
plt.legend()









    Out[14]:





<matplotlib.legend.Legend at 0x10cf59d90>

While in other notebooks the two K-means implementations separated out, with the Sklearn implementation being clearly better, here we see they very much keep pace. Also, interestingly we see the full benefits of the dual tree Boruvka algorithm for minimum spanning trees, with the HDBSCAN implementation actually outperforming ordinary DBSCAN for large dataset sizes. In higher dimensions this won't be the case, but it makes it clear that if you have very low dimensional data then HDBSCAN (from which you can extract ordinary DBSCAN results if you so desire) is clearly the better choice.

But should I get a coffee?

So we know which implementations scale and which don't; a more useful thing to know in practice is, given a dataset, what can I run interactively? What can I run while I go and grab some coffee? How about a run over lunch? What if I'm willing to wait until I get in tomorrow morning? Each of these represent significant breaks in productivity -- once you aren't working interactively anymore your productivity drops measurably, and so on.

We can build a table for this. To start we'll need to be able to approximate how long a given clustering implementation will take to run. Fortunately we already gathered a lot of that data; if we load up the statsmodels package we can fit the data (with a quadratic or $n\log n$ fit depending on the implementation) and use the resulting model to make our predictions. Obviously this has some caveats: if you fill your RAM with a distance matrix your runtime isn't going to fit the curve.

I've hand built a time_samples list to give a reasonable set of potential data sizes that are nice and human readable. After that we just need a function to fit and build the curves.



In [15]:

    
import statsmodels.formula.api as sm

time_samples = [1000, 2000, 5000, 10000, 25000, 50000, 75000, 100000, 250000, 500000, 750000,
               1000000, 2500000, 5000000, 10000000, 50000000, 100000000, 500000000, 1000000000]

def get_timing_series(data, quadratic=True):
    if quadratic:
        data['x_squared'] = data.x**2
        model = sm.ols('y ~ x + x_squared', data=data).fit()
        predictions = [model.params.dot([1.0, i, i**2]) for i in time_samples]
        return pd.Series(predictions, index=pd.Index(time_samples))
    else: # assume n log(n)
        data['xlogx'] = data.x * np.log(data.x)
        model = sm.ols('y ~ x + xlogx', data=data).fit()
        predictions = [model.params.dot([1.0, i, i*np.log(i)]) for i in time_samples]
        return pd.Series(predictions, index=pd.Index(time_samples))

Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\log n)$ algorithms utterly dominate here, and HDBSCAN achieves that complexity with the dual tree Boruvka on two dimensional data.



In [16]:

    
ap_timings = get_timing_series(ap_data)
spectral_timings = get_timing_series(spectral_data)
agg_timings = get_timing_series(agg_data)
debacl_timings = get_timing_series(debacl_data)
fastclust_timings = get_timing_series(large_fastclust_data.ix[:10,:].copy())
scipy_single_timings = get_timing_series(large_scipy_single_data.ix[:10,:].copy())
hdbscan_boruvka = get_timing_series(huge_hdbscan_data, quadratic=False)
#scipy_k_means_timings = get_timing_series(huge_scipy_k_means_data, quadratic=False)
dbscan_timings = get_timing_series(huge_dbscan_data, quadratic=False)
k_means_timings = get_timing_series(huge_k_means_data, quadratic=True)

timing_data = pd.concat([ap_timings, spectral_timings, agg_timings, debacl_timings, 
                            scipy_single_timings, fastclust_timings, hdbscan_boruvka, 
                            dbscan_timings, k_means_timings
                           ], axis=1)
timing_data.columns=['AffinityPropagation', 'Spectral', 'Agglomerative',
                                       'DeBaCl', 'ScipySingleLinkage', 'Fastcluster',
                                       'HDBSCAN', 'DBSCAN', 'SKLearn KMeans'
                                      ]
def get_size(series, max_time):
    return series.index[series < max_time].max()

datasize_table = pd.concat([
                            timing_data.apply(get_size, max_time=30),
                            timing_data.apply(get_size, max_time=300),
                            timing_data.apply(get_size, max_time=3600),
                            timing_data.apply(get_size, max_time=8*3600)
                            ], axis=1)
datasize_table.columns=('Interactive', 'Get Coffee', 'Over Lunch', 'Overnight')
datasize_table









    Out[16]:






  
    
      
      Interactive
      Get Coffee
      Over Lunch
      Overnight
    
  
  
    
      AffinityPropagation
      2000
      5000
      25000
      75000
    
    
      Spectral
      5000
      10000
      50000
      100000
    
    
      Agglomerative
      2000
      10000
      25000
      75000
    
    
      DeBaCl
      5000
      10000
      75000
      100000
    
    
      ScipySingleLinkage
      25000
      50000
      100000
      250000
    
    
      Fastcluster
      100000
      250000
      1000000
      2500000
    
    
      HDBSCAN
      500000
      2500000
      10000000
      100000000
    
    
      DBSCAN
      250000
      1000000
      10000000
      50000000
    
    
      SKLearn KMeans
      750000
      5000000
      10000000
      50000000

Conclusions

If you have two dimensional data then HDBSCN is an excellent choice, outperforming regular DBSCAN and not being too far behind K-Means.

	x	y
5	24000	7.506189
6	28000	10.821918
7	32000	22.282744
8	36000	26.050475
9	40000	212.501709
10	44000	NaN
11	48000	NaN
12	52000	NaN
13	56000	NaN
14	60000	NaN

	x	y
5	24000	10.324891
6	28000	15.216605
7	32000	21.523087
8	36000	29.564827
9	40000	174.514488
10	44000	NaN
11	48000	NaN
12	52000	NaN
13	56000	NaN
14	60000	NaN

	Interactive	Get Coffee	Over Lunch	Overnight
AffinityPropagation	2000	5000	25000	75000
Spectral	5000	10000	50000	100000
Agglomerative	2000	10000	25000	75000
DeBaCl	5000	10000	75000	100000
ScipySingleLinkage	25000	50000	100000	250000
Fastcluster	100000	250000	1000000	2500000
HDBSCAN	500000	2500000	10000000	100000000
DBSCAN	250000	1000000	10000000	50000000
SKLearn KMeans	750000	5000000	10000000	50000000